Processor and method for executing load operation and store operation thereof

ABSTRACT

A processor and a method for executing load operation and store operation thereof are provided. The processor includes a data cache and a store buffer. When executing a store operation, if the address of the store operation is the same as the address of an existing entry in the store buffer, the data of the store operation is merged into the existing entry. When executing a load operation, if there is a memory dependency between an existing entry in the store buffer and the load operation, and the existing entry includes the complete data required by the load operation, the complete data is provided by the existing entry alone. If the existing entry does not include the complete data, the complete data is generated by assembling the existing entry and a corresponding entry in the data cache.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention generally relates to memory operations ofprocessors, and more particularly, to a processor and a method forexecuting a load operation and a store operation of the processor.

2. Description of Related Art

Most current processors generally adopt an instruction pipelinearchitecture to increase the performance of the processors. In order toreduce the time for obtaining data from a memory, one such processortypically includes a data cache for temporary storage of data that isread from the memory. The data cache is divided into a data RAM and atag RAM. There are generally two types of memory operations, i.e., aload or read operation and a store or write operation. During the loadoperation, the data RAM and tag RAM can be read simultaneously. The readdata is directly used if the result of a tag comparison is cache hit anddiscarded if the result of the tag comparison is cache miss. On theother hand, during the store operation, the tag RAM must be first readto compare the tag with store address. The data is stored in the dataRAM only if the comparison result is cache hit.

Due to the above difference, the time for executing the load operationis less than the time for executing the store operation. When a storeoperation is followed by a load operation, read/write competition mayoccur in the instruction pipeline in which both the load operation andstore operation attempt to concurrently access the data RAM. At thistime, if the load operation waits until the store operation completes,stall of the load operation occurs which decreases the processingefficiency of the instruction pipeline.

To address the stall problem, U.S. Pat. No. 6,434,665 discloses a storebuffer for temporary storage of parameters such as address and data of astore operation. As such, in case a read/write competition occurs in thedata cache, the load operation can be executed before the data stored inthe store buffer is written into the data cache. However, this method isonly limited to be used when there is no memory dependency between theload operation and the store operation. That is, this method is onlyadapted to the situation where the address to be read in the loadoperation does not overlap with the address to be written in the storeoperation. When there is the above memory dependency, in order to readcorrect data, the load operation still must wait until the storeoperation completes and, therefore, the stall problem still exists.

To further solve the stall problem, U.S. Pat. No. 6,141,747 proposesanother method. In this method, the data in the store buffer is directlyforwarded to the load operation in case a read/write competition occursand there is a memory dependency between the load operation and thestore operation. As such, the load operation does not have to wait untilthe data is written into the data cache. In this method, the data isstored in the store buffer in words of multiple bytes. However, eachpiece of data is not necessarily a whole word or whole words. Forexample, the data may be half-word data or only one byte of the data isvalid data. If the data to be used in the load operation is distributedin multiple entries of the store buffer, a complex assembling mechanismis required to assemble the scattered data in the multiple entries toform the data to be forwarded to the load operation. If the store buffercannot provide the complete data required by the load operation, dataparts in the store buffer need to be written into the data cache andthen the data can be read from the data cache in the load operation,which also causes a stall problem in the instruction pipeline.

SUMMARY OF THE INVENTION

Accordingly, the present invention is directed to a method for executinga store operation that can merge data at a same address into a storebuffer to solve the foregoing problem caused by scattered data of thestore buffer.

The present invention is also directed to a method for executing a storeoperation that can assemble data of the store buffer and a data cacheand forward the assembled data to the load operation to reduce thewaiting time of the store operation when the store buffer contains onlya part of the data required by the load operation.

The present invent is also directed to a processor executing the storeoperation and the load operation using the above methods, which cansolve the foregoing problems and increase the processing efficiency.

The present invention provides a method for executing a store operation.In this method, a store buffer is first provided. When executing a storeoperation, a new entry is added in the store buffer according to thestore operation if the store buffer has no entry which has a sameaddress as an address of the store operation. Data of the storeoperation is merged into an existing entry of the store buffer if theaddress of the store operation is the same as the address of theexisting entry.

In addition, the present invention provides a method for executing aload operation. In this method, a data cache and a store buffer is firstprovided. When executing a load operation, data required by the loadoperation is read from the data cache if there is no memory dependencybetween all entries of the store buffer and the load operation. Anexisting entry of the store buffer provides complete data required bythe load operation if there is a memory dependency between the existingentry and the load operation and the existing entry contains thecomplete data required by the load operation. The complete data requiredby the load operation is generated according to data of an existingentry of the store buffer and data of a corresponding entry of the datacache if there is a memory dependency between the existing entry and theload operation and the existing entry does not contain the completedata.

The present invention provides a processor including a data cache and astore buffer. The data cache stores data read from a memory. The storebuffer is coupled to the data cache. The store buffer is used fortemporary storage of an address and data of a store operation when aload operation and the store operation compete to access the data cache.The processor adds a new entry in the store buffer according to thestore operation if the store buffer has no entry which has a sameaddress as an address of the store operation. The processor merges dataof the store operation into an existing entry of the store buffer if theaddress of the store operation is the same as the address of theexisting entry.

According to one embodiment of the present invention, the new entryincludes the address, a mask and the data of the store operation.

According to one embodiment of the present invention, when merging thedata of the store operation into the existing entry, the processorgenerates a mask of the store operation according to the address and adata type of the store operation, generates a merged mask according tothe mask of the store operation and a mask of the existing entry,generates merged data according to the mask and data of the storeoperation and data of the existing entry, and stores the merged mask andmerged data into the existing entry.

According to one embodiment of the present invention, the merged mask isgenerated based on a logic operation on the mask of the store operationand the mask of the existing entry. Each bit of the mask of the storeoperation is a first preset value or a second preset value. A portion ofthe merged data that corresponds to the first preset value adopts thedata of the store operation, and a portion of the merged data thatcorresponds to the second preset value adopts the data of the existingentry.

In addition, the present invention provides a processor including a datacache and a store buffer. The data cache stores data read from a memory.The store buffer is coupled to the data cache and is used for temporarystorage of an address and data of a store operation when a loadoperation and the store operation compete to access the data cache. Theprocessor reads data required by the load operation from the data cacheif there is no memory dependency between all entries of the store bufferand the load operation. The processor reads complete data required bythe load operation from an existing entry of the store buffer if thereis a memory dependency between the existing entry and the load operationand the existing entry contains the complete data required by the loadoperation. The processor generates the complete data required by theload operation according to data of an existing entry of the storebuffer and data of a corresponding entry of the data cache if there is amemory dependency between the existing entry and the load operation andthe existing entry does not contain the complete data.

According to one embodiment of the present invention, the address of theexisting entry of the store buffer is the same as the address of thecorresponding entry of the data cache. The aforementioned complete datais generated based on the mask and the data of the existing entry andthe data of the corresponding entry.

According to one embodiment of the present invention, each bit of themask of the existing entry is a first preset value or a second presetvalue, a portion of the complete data that corresponds to the firstpreset value adopts the data of the existing entry, and a portion of thecomplete data that corresponds to the second preset value adopts thedata of the corresponding entry.

In order to make the aforementioned and other features and advantages ofthe present invention more comprehensible, embodiments accompanied withfigures are described in detail below.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates an architecture of a processor according to oneembodiment of the present invention.

FIG. 2 is a flow chart of a store operation and a load operationaccording to one embodiment of the present invention.

FIG. 3 illustrates an internal data structure of a store bufferaccording to one embodiment of the present invention.

FIG. 4 illustrates a method for generating a memory operation maskaccording to one embodiment of the present invention.

FIG. 5 illustrates a mask operation during the merging of data into thestore buffer according to one embodiment of the present invention.

FIG. 6 illustrates a data operation during the merging of data into thestore buffer according to one embodiment of the present invention.

FIG. 7 illustrates an operation during the assembling of the storebuffer and the data cache to generate the data required by the loadoperation according to one embodiment of the present invention.

DESCRIPTION OF THE EMBODIMENTS

FIG. 1 illustrates the architecture of a processor 100 according to oneembodiment of the present invention. The processor 100 includes anaddress generation unit (AGU) 110, a data assemble unit 120, a storebuffer 130, and a data cache 140. The AGU 110, data assemble unit 120,and store buffer 130 belong to the memory and execution stage of aninstruction pipeline 105 and are coupled to each another via theinstruction pipeline 105. The store buffer 130 is also coupled to thedata cache 140.

The AGU 110 operates to generate addresses for the load operation andstore operation. The data cache 140 stores data read from a memory (notshown). When the load operation and the store operation compete toaccess the data cache 140, the store buffer 130 temporarily stores theaddress and data of the store operation. The data assemble unit 120 mayassemble the data from the store buffer 130 and data from the data cache140, and the assembled data can be used by the load operation asdescribed below in greater detail. The store buffer 130 may be used toaddress the read/write competition in the data cache 140 to increase theprocessing efficiency of the instruction pipeline 105. When theread/write competition occurs, the load operation is executed prior tothe store operation.

FIG. 2 is a flow chart of a memory operation executed by the processor100 of the present embodiment. Firstly, the AGU 110 in the front of theexecution stage generates the address and data type (data type isdescribed later) of each memory operation (step 205). Then, theprocessor 110 determines the type of the memory operation (step 210). Ifit is a store operation, the processor 100 compares the address of thestore operation with an address of each entry of the store buffer 130(step 215) to check if any entry in the store buffer 130 has a sameaddress as the address of the store operation (step 220). If there is nosame address, the processor 100 adds a new entry in the store buffer130.

FIG. 3 illustrates a data structure stored in the store buffer 130 ofthe present embodiment. The data stored in the store buffer 130 has alength of one word. Each word has four bytes and is thirty-two bitslong. Each row of the table of FIG. 3 represents an entry including fivefields, i.e., address, mask, data, valid bit, and matched bit.

The address field of FIG. 3 records the address of the store operation,with two least significant bits (LSB) removed to fit the data length ofone word. The mask field of FIG. 3 records a mask of the storeoperation. Generation of the mask is shown in the table of FIG. 4. InFIG. 4, all numerals are binary and the two least significant bits ofthe store operation address are written into the address field. The datatype field of FIG. 4 records the data type of the store operation where“00” represents byte, “01” represents half word consisting of two bytes,and “11” represents word consisting of four bytes. The mask field ofFIG. 4 records the mask generated according to the address and data typeof the same row, which is also the value filled in the mask field ofFIG. 3 when adding a new entry. Because the length of the data stored ineach entry of the store buffer 130 is four bytes, the length of the maskfield of each entry is four bits long. The four bits of the mask fieldand the four bytes of the data field are in one-to-one correspondence toeach other. If the bit of the mask field is “1”, it represents thecorresponding byte is valid data. In an alternative embodiment of thepresent invention, bit “0” of the mask field may be used to representthat the corresponding byte is valid data.

The data field of FIG. 3 records the data to be written in the storeoperation. When the valid bit is set, it represents that the entryhaving the valid bit is valid. When the valid bit is cleared, itrepresents that the entry having the valid bit is invalid and can beoverwritten with a new entry. The matched bit is used in the datamerging step (step 240) as described later in greater detail. Whenadding a new entry, the processor 100 generates a mask according to theaddress and data type of the store operation, writes the address andmask of the store operation into corresponding fields of the entry, setsthe valid bit, and clears the matched bit. In a later stage of theinstruction line 105, when the data of the store operation has beenprepared, the processor 100 writes the data into the data field of theentry.

In the present embodiment, the three bit fields are all configured suchthat bit “1” represents a setting state and bit “0” represents aclearing state. In alternative embodiments of the present invention, itis also possible that bit “0” represents a setting state and bit “1”represents a clearing state. In the present embodiment, the entry of thestore buffer 130 can record data of a maximum of thirty-two bits long.However, the present invention should not be limited to the embodimentsdescribed herein and the data field length of the entry can be modifieddepending upon actual requirements in alternative embodiments. Forexample, if data of sixty-four bits long, i.e., a double word, is to bestored, the address field of each entry can be modified such that threeleast significant bits are removed from the complete address, the maskfield can be lengthened to eight bits, and the data length can belengthened to eight bytes.

Referring back to FIG. 2, in the checking step 220, if the address ofthe store operation is the same as the address of an existing entry ofthe store buffer 130, the processor 100 proceeds to steps 230-245 suchthat the data of the store operation is merged into the existing entry,as described below in detail.

Firstly, the processor 100 sets a matched bit of the existing entry(step 230) which indicates that data merging is going to be performed,and generates a mask of the store operation according to the address anddata type of the store operation in the manner illustrated in FIG. 4(step 235). In a later stage of the instruction pipeline 105, when thedata has been prepared, the processor 100 merges the data of the storeoperation into the existing entry (step 240), and clears the matched bit(step 245) which indicates that data merging has been completed.

An exemplary step 240 is illustrated in detail in FIG. 5 and FIG. 6where the masks are both binary numbers and the data are all hexadecimalnumbers. During data merging, the processor 100 performs a logic ORoperation on the mask of the store operation and the mask of theexisting entry to generate a merged mask, as shown in FIG. 5. Inaddition, the processor 100 assembles the data of the store operationand the data of the existing entry to generate merged data. An exemplaryway of assembling the data is shown in FIG. 6. In this example, the dataof store operation takes the priority and bit “1” in the store operationmask represents that the bytes of the corresponding store operation dataare valid data. Therefore, the processor 100 adopts the data of thestore operation for the bytes corresponding to bit “1”, and adopts thedata of the existing entry for the bytes corresponding to bit “0”, whichresults in the merged data.

In an alternative embodiment of the present invention, in the storeoperation mask and existing entry mask, bit “0” represents that thecorresponding data bytes are valid data. In this case, the processor 100performs a logic AND operation on the store operation mask and theexisting entry mask to generate a merged mask. As to the data merging,the processor 100 adopts the store operation data for the bytescorresponding to bit “0” in the store operation mask and adopts theexisting entry data for the bytes corresponding to bit “1”.

Next, the processor 100 stores the merged mask and merged data in theexisting entry. As such, the store buffer 130 has at most one entry fordata at a same address, which is different from the conventionaltechnology in which data at a same address are distributed in multipleentries and data of multiple entries need to be merged during theforwarding operation.

After writing the store operation data into the store buffer 130 ormerging the store operation data into the existing entry of the storebuffer 130, the processor 100 writes the data of the entry with alongest history in the store buffer 130 into the data cache 140 if noread/write competition occurs in the data cache 140 (step 250). Forexample, the store buffer 130 can be configured in a first-in first-outqueue such that the entry at the head of the store buffer 130 has thelongest history. After writing the data into the data cache 140, theprocessor 100 clears the valid bit of the entry having the longesthistory to release the storage space of the entry. Then, the processflow ends. The flow that the processor 100 executes the store operationhas been described above. On the other hand, in the determination step210, if the new memory operation is a load operation, the processor 100compares the address of the load operation with the address of eachentry in the store buffer 130 (step 255) to check if the address of anyentry is the same as the address of the load operation (step 260). Ifthere is no same address, the processor 100 directly reads data from thedata cache 140 and allows the load operation to use the read data (step275). Then, the process flow ends.

If the store buffer 130 has an existing entry which has a same addressas the address of the load operation, the processor 100 proceeds tocompare the load operation mask with the existing entry mask to check ifthe two masks overlap (if they have common bit “1” or not) (step 270).The load operation mask is likewise generated in the manner shown inFIG. 4. If the two masks do not overlap, the processor 100 likewisedirectly reads data from the data cache 140 and allows the loadoperation to use the read data (step 275). Then, the process flow ends.

On the contrary, if the address of the existing entry is the same as theaddress of the load operation and the existing entry mask overlaps withthe load operation mask, which represents that there is memorydependency between the store buffer 130 and the load operation, completeor a part of data required by the load operation must be provided by thestore buffer 130. Next, the processor 100 checks if the existing entrycontains the complete data required by the load operation (step 280). Ifyes, the processor 100 reads the complete data from the data field ofthe existing entry and forwards the complete data to the load operationfor use (step 290). Then, the process flow ends.

If the existing entry contains only a part of data required by the loadoperation instead of the complete data, the processor 100 assembles thedata of the existing entry and data of a corresponding entry in the datacache 140 that has the same address to generate the complete datarequired by the load operation, and forwards the complete data to theload operation for use (step 285). Then, the process flow ends.

The data are assembled in an exemplary manner as shown in FIG. 7 wherethe masks are binary numbers and the data are hexadecimal numbers. Inthis exemplary manner of data assembling, the existing entry of thestore buffer takes the priority and bit “1” in the store operation maskrepresents that the data bytes of the existing entry are valid data.Therefore, the processor 100 adopts the data of the existing entry forthe bytes corresponding to bit “1”, and adopts the data of thecorresponding entry of the data cache 140 for the bytes corresponding tobit “0”, which results in the complete data to be forwarded to the loadoperation.

In an alternative embodiment of the present invention, in the mask ofthe existing entry of the store buffer 130, bit “0” represents that thecorresponding data bytes are valid data. In this case, the processor 100adopts the existing entry data for the bytes corresponding to bit “0” inthe existing entry mask, and adopts the data of the corresponding entryof the data cache 140 for the bytes corresponding to bit “1”.

If there were no data assembling step 285, the data of the existingentry must be written into the data cache 140 before the complete datacan be read out from the data cache 140. The data assembling of thepresent embodiment at least eliminates the time of writing the existingentry data into the data cache 140.

In summary, in the present invention, the data of the store operation ismerged into the existing entry of the store buffer such that data of thesame address are contained in at most one entry, which saves the storagespace of the store buffer and reduces the complexity of forwarding datafrom the store buffer to the load operation. The present invention candirectly assemble the data in the store buffer and the data cache andforward the assembled data to the load operation, which eliminates thetime of writing the data from the store buffer to the data cache thusenhancing the efficiency of the processor.

It will be apparent to those skilled in the art that variousmodifications and variations can be made to the structure of the presentinvention without departing from the scope or spirit of the invention.In view of the foregoing, it is intended that the present inventioncover modifications and variations of this invention provided they fallwithin the scope of the following claims and their equivalents.

1. A method for executing a store operation, comprising: providing astore buffer; when executing a store operation, adding a new entry inthe store buffer according to the store operation if the store bufferhas no entry which has a same address as an address of the storeoperation; and merging data of the store operation into an existingentry of the store buffer if the address of the store operation is thesame as an address of the existing entry.
 2. The method for executingthe store operation according to claim 1, wherein the new entry includesthe address, a mask and the data of the store operation.
 3. The methodfor executing the store operation according to claim 1, wherein mergingthe data of the store operation into the existing entry includes:generating a mask of the store operation according to the address and adata type of the store operation; generating a merged mask according tothe mask of the store operation and a mask of the existing entry;generating merged data according to the mask and data of the storeoperation and data of the existing entry; and storing the merged maskand the merged data into the existing entry.
 4. The method for executingthe store operation according to claim 3, wherein the merged mask isgenerated based on a logic operation on the mask of the store operationand the mask of the existing entry; each bit of the mask of the storeoperation is a first preset value or a second preset value, a portion ofthe merged data that corresponds to the first preset value adopts thedata of the store operation, and a portion of the merged data thatcorresponds to the second preset value adopts the data of the existingentry.
 5. The method for executing the store operation according toclaim 1, further comprising: providing a data cache; and writing data ofan entry having a longest history in the store buffer into the datacache if no read/write competition occurs in the data cache.
 6. A methodfor executing a load operation, comprising: providing a data cache and astore buffer; when executing a load operation, reading data required bythe load operation from the data cache if there is no memory dependencybetween all entries of the store buffer and the load operation, thememory dependency being such that an address of any of the entries isthe same as an address of the load operation and a mask of said entryoverlaps with a mask of the load operation; providing complete datarequired by the load operation by an existing entry of the store bufferif there is memory dependency between the existing entry and the loadoperation and the existing entry contains the complete data required bythe load operation; and generating the complete data required by theload operation according to data of the existing entry of the storebuffer and data of a corresponding entry of the data cache if there ismemory dependency between the existing entry and the load operation andthe existing entry does not contain the complete data.
 7. The method forexecuting the load operation according to claim 6, wherein an address ofthe existing entry is the same as an address of the corresponding entry.8. The method for executing the load operation according to claim 6,wherein the complete data is generated based on a mask and the data ofthe existing entry and the data of the corresponding entry.
 9. Themethod for executing the load operation according to claim 8, whereineach bit of the mask of the existing entry is a first preset value or asecond preset value, a portion of the complete data that corresponds tothe first preset value adopts the data of the existing entry, and aportion of the complete data that corresponds to the second preset valueadopts the data of the corresponding entry.
 10. A processor comprising:a data cache configured to store data read from a memory; and a storebuffer coupled to the data cache and configured for temporary storage ofan address and data of a store operation when a load operation and thestore operation compete to access the data cache; wherein the processoradds a new entry in the store buffer according to the store operation ifthe store buffer has no entry which has a same address as the address ofthe store operation; and the processor merges the data of the storeoperation into an existing entry of the store buffer if the address ofthe store operation is the same as an address of the existing entry. 11.The processor according to claim 10, wherein the new entry includes theaddress, a mask and the data of the store operation.
 12. The processoraccording to claim 10, wherein when merging the data of the storeoperation into the existing entry, the processor generates a mask of thestore operation according to the address and a data type of the storeoperation, generates a merged mask according to the mask of the storeoperation and a mask of the existing entry, generates merged dataaccording to the mask and the data of the store operation and data ofthe existing entry, and stores the merged mask and the merged data intothe existing entry.
 13. The processor according to claim 12, wherein themerged mask is generated based on a logic operation on the mask of thestore operation and the mask of the existing entry; each bit of the maskof the store operation is a first preset value or a second preset value,a portion of the merged data that corresponds to the first preset valueadopts the data of the store operation, and a portion of the merged datathat corresponds to the second preset value adopts the data of theexisting entry.
 14. The processor according to claim 10, wherein theprocessor writes data of an entry having a longest history in the storebuffer into the data cache if no read/write competition occurs in thedata cache.
 15. A processor comprising: a data cache configured to storedata read from a memory; and a store buffer coupled to the data cacheand configured for temporary storage of an address and data of a storeoperation when a load operation and the store operation compete toaccess the data cache; wherein the processor reads data required by theload operation from the data cache if there is no memory dependencybetween all entries of the store buffer and the load operation, thememory dependency being such that an address of any of the entries isthe same as an address of the load operation and a mask of said entryoverlaps with a mask of the load operation; the processor reads completedata required by the load operation from an existing entry of the storebuffer if there is memory dependency between the existing entry and theload operation and the existing entry contains the complete datarequired by the load operation; and the processor generates the completedata required by the load operation according to data of the existingentry of the store buffer and data of a corresponding entry of the datacache if there is memory dependency between the existing entry and theload operation and the existing entry does not contain the completedata.
 16. The processor according to claim 15, wherein an address of theexisting entry is the same as an address of the corresponding entry. 17.The processor according to claim 15, wherein the complete data isgenerated based on a mask and the data of the existing entry and thedata of the corresponding entry.
 18. The processor according to claim17, wherein each bit of the mask of the existing entry is a first presetvalue or a second preset value, a portion of the complete data thatcorresponds to the first preset value adopts the data of the existingentry, and a portion of the complete data that corresponds to the secondpreset value adopts the data of the corresponding entry.